1 Introduction
Linkage maps are essential tools in several genetic studies such as quantitative trait loci (QTL) analysis, understanding chromosomal evolutionary processes, assessment of collinearity between genomes, and study of meiotic processes. The principle behind linkage map construction is detecting recombinant events between genomic positions and summarizing them into pairwise recombination fractions. In diploids, the assessment of such a phenomenon is relatively straightforward. After the homolog duplication, sister chromatids pair up exchanging segments. The presence of informative markers, e. g. single nucleotide polymorphisms (SNPs) enable estimating the recombination fraction between pairs of genomic positions by comparing the chromosome constitution of parents and offspring.
In polyploids (species with more than two sets of homolog chromosomes, or homologs), the construction of such maps is more challenging. While in diploids a biallelic marker is fully informative to distinguish between both alleles present in a pair of homologs, in polyploids they allow accounting for proportions of allelic dosages. The limited information imposed by this system makes the assessment of recombination events during the meiosis troublesome, specially as the ploidy level gets larger. To recover the multiallelic information present in polyploid species, we need to account for recombination frequencies, estimate phase configurations and reconstruct the haplotypes of both parents and individuals in the population.
MAPpoly is an R package fully capable of building genetic linkage maps for biparental populations in polyploid species with ploidy level up to 8x. It was developed as part of the Genomic Tools for Sweetpotato Improvement (GT4SP), funded by Bill and Melinda Gates Foundation. All of the statistical procedures used in the functions presented in this document are detailed in Mollinari and Garcia (2019) and Mollinari et al. (2020). The results obtained with MAPpoly can be readily used for QTL mapping with the QTLpoly package, which implements the procedures proposed by da Silva Pereira et al. (2020).
Some characteristics of MAPpoly (v 0.2.1)
- Can handle multiple dataset types
- Can handle a large number of markers. See this publication for an example of ~30,000 SNPs in hexaploid sweetpotato.
- Incorporates genomic information
- Explores multipoint information (through Hidden Markov Model)
- Can reconstruct haplotypes for parents and all individuals in the population
- Recovers the multiallelic nature of polyploid genomes
- Detects occurrence, location and frequency of multivalent pairing during meiosis
- Robust enough to build genetic linkage maps with multiallelic markers (when available - see function merge_maps)
1.1 MAPpoly installation
1.2 From CRAN (stable version)
To install MAPpoly from the The Comprehensive R Archive Network (CRAN) use
install.packages("mappoly")
1.3 From GitHub (development version)
You can install the development version from Git Hub. Within R, you need to install devtools:
install.packages("devtools")
If you are using Windows, you must install the the latest recommended version of Rtools.
To install MAPpoly from Git Hub use
devtools::install_github("mmollina/mappoly", dependencies=TRUE)
2 Loading datasets
In its current version, MAPpoly can handle the following types of datasets: 1. CSV files 2. MAPpoly files - Dosage based - Probability based 3. fitPoly files 4. VCF files
MAPpoly also is capable of importing objects generated by the following R packages
Both CSV and MAPpoly datasets are sensible to formatting errors, such as additional spaces, commas and wrong encoding (non-UTF-8). If you have any trouble, please double check your files before submitting an issue. You can find detailed steps of all supported files in the following sections. Also, in large datasets, it is expected that a considerable proportion of markers will contain redundant information. Thus, all reading functions will set these redundant markers aside to be included in the final map.
2.1 Reading CSV files
The preparation of a CSV file for MAPpoly can be done in Microsoft Excel or any other spreadsheet software of your preference. In this file, each line contains a marker and each column contains information about the marker. In its current version, MAPpoly can handle .csv files with allelic dosage data.
The first line of the CSV file should contain headers for all columns. The first five columns should include the following information: marker name, dosage for both parents (one column for each), a sequence number (e.g. a chromosome number, if available) and a sequence position (e.g. the marker position within the chromosome, if available). In addition to these five headers, you should include the name of all individuals in the population. From the second line onwards, all columns should contain its values, including allelic dosages for all individuals. Missing or absent values should be represented by NA.
NOTE: If genomic information is not available, the ‘sequence’ and ‘sequence position’ columns should be filled with NA’s.
Example:
(#fig:data set_example)Example of CSV data set
You can read CSV files with the read_geno_csv function:
## Download CSV file from GitHub
ft <- "https://raw.githubusercontent.com/mmollina/SCRI/main/data/B2721_dose.csv"
tempfl <- tempfile()
download.file(ft, destfile = tempfl)
## Reading CSV file
dat.dose.csv <- read_geno_csv(file.in = tempfl, ploidy = 4)
#> Reading the following data:
#> Ploidy level: 4
#> No. individuals: 156
#> No. markers: 7093
#> No. informative markers: 7093 (100%)
#> This dataset contains sequence information.
#> ...
#> Done with reading.
#> Filtering non-conforming markers.
#> ...
#> Done with filtering.
unlink(tempfl)
In addition to the CSV file path, you must indicate the ploidy level using the ploidy argument. This function automatically excludes uninformative markers. It also performs chi-square tests for all markers, considering the expected segregation under Mendelian inheritance, random chromosome pairing and no double reduction. You can optionally use the filter.non.conforming logical argument (default = TRUE), which excludes non-expected genotypes under these assumptions. However, keep in mind that the linkage analysis functions in MAPpoly do not support double-reduction and this can cause software failures.
2.2 Reading MAPpoly files
MAPpoly can also handle two dataset types that follow the same format: (1) a genotype-based file (with allelic dosages) and (2) probability-based file. Both are text files with the same header, but with different genotype table formats.
For both files, the header should contain: ploidy level, number of individuals (nind), number of markers (nmrk), marker names (mrknames), individual names (indnames), allele dosages for parent 1 (dosageP), allele dosages for parent 2 (dosageQ), sequence/chromosome information (seq), position of each marker (seqpos), number of phenotypic traits (nphen) and the phenotypic data (pheno) if available. The header should be organized according to this example:
ploidy 4
nind 3
nmrk 5
mrknames M1 M2 M3 M4 M5
indnames Ind1 Ind2 Ind3
dosageP 0 2 0 0 3
dosageQ 1 2 1 1 3
seq 1 1 2 2 3
seqpos 100 200 50 150 80
nphen 0
pheno-----------------------
geno------------------------
For more information about MAPpoly file format, please see ?read_geno and ?read_geno_prob documentation from MAPpoly package.
2.2.1 Using read_geno
The header should be followed by a table containing the genotypes (allele dosages) for each marker (rows) and for each individual (columns), as follows:
| Individual 1 | Individual 2 | Individual 3 | |
|---|---|---|---|
| Marker 1 | 1 | 0 | 0 |
| Marker 2 | 3 | 0 | 2 |
| Marker 3 | 1 | 0 | 0 |
| Marker 4 | 1 | 0 | 0 |
| Marker 5 | 3 | 4 | 4 |
The final file should look like the example below:
ploidy 4
nind 3
nmrk 5
mrknames M1 M2 M3 M4 M5
indnames Ind1 Ind2 Ind3
dosageP 0 2 0 0 3
dosageQ 1 2 1 1 3
seq 1 1 2 2 3
seqpos 100 200 50 150 80
nphen 0
pheno-----------------------
geno------------------------
1 0 0
3 0 2
1 0 0
1 0 0
3 4 4
Then, use the read_geno function to read your file:
## Download dataset from GitHub
fl = "https://raw.githubusercontent.com/mmollina/SCRI/main/data/B2721_mappoly_dose"
tempfl <- tempfile()
download.file(fl, destfile = tempfl)
## Reading
dat.dose.mpl <- read_geno(file.in = tempfl, elim.redundant = TRUE)
#> Reading the following data:
#> Ploidy level: 4
#> No. individuals: 156
#> No. markers: 6502
#> No. informative markers: 6502 (100%)
#> This dataset contains sequence information.
#> ...
#>
#> Done with reading.
#> Filtering non-conforming markers.
#> ...
#> Performing chi-square test.
#> ...
#> Done.
unlink(tempfl)
2.2.2 Using read_geno_prob
Following the same header described before, read_geno_prob reads a table containing the probability distribution for each combination of marker \(\times\) individual. Each line on this table represents the combination of one marker with one individual, and the respective probabilities. The first two columns represent the marker and the individual, respectively, and the remaining elements represent the probability associated with each one of the possible dosages
| Marker | Individual | \(p(d=0)\) | \(p(d=1)\) | \(p(d=2)\) | \(p(d=3)\) | \(p(d=4)\) |
|---|---|---|---|---|---|---|
| M1 | Ind1 | 0.5 | 0.5 | 0.0 | 0.0 | 0.0 |
| M2 | Ind1 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| M3 | Ind1 | 0.3 | 0.7 | 0.0 | 0.0 | 0.0 |
| M4 | Ind1 | 0.5 | 0.5 | 0.0 | 0.0 | 0.0 |
| M5 | Ind1 | 0.0 | 0.0 | 0.0 | 0.9 | 0.1 |
| M1 | Ind2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| M2 | Ind2 | 0.2 | 0.5 | 0.3 | 0.0 | 0.0 |
| M3 | Ind2 | 0.9 | 0.1 | 0.0 | 0.0 | 0.0 |
| M4 | Ind2 | 0.9 | 0.1 | 0.0 | 0.0 | 0.0 |
| M5 | Ind2 | 0.0 | 0.0 | 0.0 | 0.2 | 0.8 |
| M1 | Ind3 | 0.2 | 0.8 | 0.0 | 0.0 | 0.0 |
| M2 | Ind3 | 0.4 | 0.6 | 0.0 | 0.0 | 0.0 |
| M3 | Ind3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| M4 | Ind3 | 0.0 | 0.1 | 0.9 | 0.0 | 0.0 |
| M5 | Ind3 | 0.1 | 0.9 | 0.0 | 0.0 | 0.0 |
Notice that each marker \(\times\) individual combination have \(ploidy + 1\) probabilities, i.e., from zero to ploidy level. Also, each line must sum to 1. The final file (header + table) should look like the following example:
ploidy 4
nind 3
nmrk 5
mrknames M1 M2 M3 M4 M5
indnames Ind1 Ind2 Ind3
dosageP 0 2 0 0 3
dosageQ 1 2 1 1 3
seq 1 1 2 2 3
seqpos 100 200 50 150 80
nphen 0
pheno-----------------------
geno------------------------
M1 Ind1 0.5 0.5 0.0 0.0 0.0
M2 Ind1 0.0 1.0 0.0 0.0 0.0
M3 Ind1 0.3 0.7 0.0 0.0 0.0
M4 Ind1 0.5 0.5 0.0 0.0 0.0
M5 Ind1 0.0 0.0 0.0 0.9 0.1
M1 Ind2 1.0 0.0 0.0 0.0 0.0
M2 Ind2 0.2 0.5 0.3 0.0 0.0
M3 Ind2 0.9 0.1 0.0 0.0 0.0
M4 Ind2 0.9 0.1 0.0 0.0 0.0
M5 Ind2 0.0 0.0 0.0 0.2 0.8
M1 Ind3 0.2 0.8 0.0 0.0 0.0
M2 Ind3 0.4 0.6 0.0 0.0 0.0
M3 Ind3 1.0 0.0 0.0 0.0 0.0
M4 Ind3 0.0 0.1 0.9 0.0 0.0
M5 Ind3 0.1 0.9 0.0 0.0 0.0
To read the dataset use
## Download dataset from GitHub
ft="https://raw.githubusercontent.com/mmollina/SCRI/main/data/B2721_mappoly_prob"
tempfl <- tempfile()
download.file(ft, destfile = tempfl)
## Reading
dat.prob.mpl <- read_geno_prob(file.in = tempfl, prob.thres = 0.95, elim.redundant = TRUE)
#> Reading the following data:
#> Ploidy level: 4
#> No. individuals: 156
#> No. markers: 6502
#> No. informative markers: 6502 (100%)
#> This dataset contains sequence information.
#> ...
#> Done with reading.
#> Filtering non-conforming markers.
#> ...
#> Performing chi-square test.
#> ...
#> Done.
unlink(tempfl)
Important note: as this type of file contains the probability distribution of all possible dosages and it will take longer to read.
You can define the minimum probability value necessary to call a dosage using the prob.thres argument. If the higher probability for a marker and individual passes this threshold, then its associated dosage is used. However, if none of the probabilities reach this threshold, then its dosage is considered missing (NA). Function read_geno_prob also performs the same filtering described in read_geno
2.3 Reading VCF files
MAPpoly can also handle VCF files (>= V.4.0) generated by the most programs, such as TASSEL, GATK, and Stacks. As few of them can handle poliploidy and estimate genotypes in a satisfactory manner, you may use other tools specifically developed to estimate allele dosages. Briefly, these programs use the allele read counts (or intensity) for each marker \(\times\) individual combination, and determines which is the most likely allele dosage. Examples are SuperMASSA, fitPoly, ClusterCall, updog, and PolyRAD. After allele dosage estimation, your VCF file should contain values for the field GT similar to 1/1/1/0 (a triplex marker in an autotetraploid, for example) rather than 1/0. Since MAPpoly uses dosages (or their probabilities) to build genetic maps, we strongly recommend using programs capable of estimating dosages before building the map. fitPoly, PolyRAD, and updog have direct integration with MAPpoly, as will be described in the next sections.
To demonstrate the read_vcf function, let us download an autohexaploid sweetpotato VCF file from the MAPpoly's vignettes repository on Github and read it:
## Download dataset from GitHub
ft <- "https://github.com/mmollina/MAPpoly_vignettes/raw/master/data/BT/sweetpotato_chr1.vcf.gz"
tempfl <- tempfile()
download.file(ft, destfile = tempfl)
## Reading
dat.dose.vcf <- read_vcf(file = tempfl, parent.1 = "PARENT1", parent.2 = "PARENT2", verbose = FALSE)
#> Registered S3 method overwritten by 'vegan':
#> method from
#> rev.hclust dendextend
unlink(tempfl)
Besides the path to your VCF file, you should indicate parent.1 and parent.2 names. Parent names must be exactly the same strings as in the VCF file. The ploidy level will be automatically detected, but you may indicate it using the optional ploidy argument. With this argument, the function will check for possible errors in your dataset. For species with variable ploidy levels (i.e. sugarcane), please indicate the desired ploidy level using the ploidy argument; if absent, MAPpoly will use the smallest ploidy level detected.
This function also has options to filter out undesired markers or data points, such as those with low depth or high proportion of missing data. You can define the following filter arguments: set min.av.depth to remove markers with average depth below the provided value (default = 0); set min.gt.depth to remove data points that present depth below the provided (default = 0); set max.missing to any value between 0 and 1 in order to remove markers that present missing data proportion above this value (default = 1). read_vcf also perform the same filtering described for function read_geno. Object generated by read_vcf has some additional information when compared to previous functions: reference and alternative alleles (bases) for each marker; and average depth of each marker. You can inspect all marker depths using the following code as an example:
library(ggplot2)
dosage_combs = cbind(dat.dose.vcf$dosage.p, dat.dose.vcf$dosage.q)
dc_simplex = apply(dosage_combs,1,function(x) if(all(c(0,1) %in% x) |
all(c(dat.dose.vcf$m-1, dat.dose.vcf$m) %in% x)) return(TRUE) else return(FALSE))
dc_dsimplex = apply(dosage_combs,1,function(x) if(all(x == c(1,1)) |
all(x == c(dat.dose.vcf$m-1, dat.dose.vcf$m-1))) return(TRUE) else return(FALSE))
dc_simplex[which(dc_simplex == TRUE)] = "simplex"
dc_simplex[which(dc_dsimplex == TRUE)] = 'double simplex'
dc_simplex[which(dc_simplex == 'FALSE')] = 'multiplex'
data_depths = data.frame('Marker depths' = dat.dose.vcf$all.mrk.depth,
'Depth classes' = findInterval(dat.dose.vcf$all.mrk.depth, c(200,300,400,500,600,50000)),
'Dosage combinations' = dc_simplex, check.names = F)
ggplot(data_depths, aes(fill=`Dosage combinations`, x=`Depth classes`, y=`Marker depths`)) +
geom_bar(position = 'stack', stat = 'identity') +
scale_x_continuous(breaks=0:5, labels=c("[0,200)","[200,300)","[300,400)","[400,500)","[500,600)", "> 600"))
2.4 Importing data from fitPoly
You can export datasets generated by fitPoly package using
## Downloading and reading B2721 fitPoly's probabilistic scores
address <- "https://github.com/mmollina/SCRI/raw/main/data/fitpoly_tetra_call/B2721_scores.zip"
tempfl <- tempfile(pattern = "B2721_", fileext = ".zip")
download.file(url = address, destfile = tempfl)
unzip(tempfl, files = "B2721_scores.dat")
unlink(tempfl)
dat <- read_fitpoly(file.in = "B2721_scores.dat",
ploidy = 4,
parent1 = "Atlantic",
parent2 = "B1829",
verbose = TRUE)
#> Reading the following data:
#> Ploidy level: 4
#> No. individuals: 156
#> No. markers: 7867
#> No. informative markers: 7093 (90.2%)
#> ...
#> Done with reading.
#> Filtering non-conforming markers.
#> ...
#> Performing chi-square test.
#> ...
#> Done.
The input file should be generated using the function fitPoly::saveMarkerModels. Here is an example of genotype calling using saveMarkerModels. If available, you can also include genome information in the dataset. Here is an example of including the Solanum tuberosum genome v4.03 information.
2.5 Importing data from PolyRAD
The R package PolyRAD has its own function to export genotypes to the MAPpoly’s genotype probability distribution format. One may use the commands above to import from PolyRAD:
# load example dataset from polyRAD
library(polyRAD)
data(exampleRAD_mapping)
exampleRAD_mapping = SetDonorParent(exampleRAD_mapping, "parent1")
exampleRAD_mapping = SetRecurrentParent(exampleRAD_mapping, "parent2")
exampleRAD_mapping = PipelineMapping2Parents(exampleRAD_mapping)
#> Making initial parameter estimates...
#> Generating sampling permutations for allele depth.
#> Updating priors using linkage...
#> Done.
# export to MAPpoly
outfile2 = tempfile()
Export_MAPpoly(exampleRAD_mapping, file = outfile2)
# Read in MAPpoly
mydata_polyrad = read_geno_prob(outfile2)
#> Reading the following data:
#> Ploidy level: 2
#> No. individuals: 100
#> No. markers: 4
#> No. informative markers: 4 (100%)
#> This dataset contains sequence information.
#> ...
#> Done with reading.
#> Filtering non-conforming markers.
#> ...
#> Performing chi-square test.
#> ...
#> Done.
mydata_polyrad
2.6 Importing data from updog
You can use MAPpoly’s function import_from_updog to import any dataset generated by updog’s function multidog, following the example below:
# Load example dataset from updog
library(updog)
data(uitdewilligen)
mout = multidog(refmat = t(uitdewilligen$refmat),
sizemat = t(uitdewilligen$sizemat),
ploidy = uitdewilligen$ploidy,
model = "f1",
p1_id = colnames(t(uitdewilligen$sizemat))[1],
p2_id = colnames(t(uitdewilligen$sizemat))[2],
nc = 4)
mydata_updog = import_from_updog(mout)
Notice that updog removes both sequence and sequence position information that may be present in the VCF file. We highly recommend the incorporation of this information during the linkage map building, when available.
2.7 Importing data from polymapR
Besides genotype calling packages, MAPpoly also is capable to import datasets and phased maps from polymapR. This functionality can be important when the user already have their map constructed using polymapR and wanted to obtain the genotype conditional probabilities using the HMM-based approach to be used in further analysis, e.g., QTL mapping, haplotype reconstruction, meiotic studies, etc. To import the dataset, use the following code
require(polymapR)
#> Loading required package: polymapR
#>
#> Attaching package: 'polymapR'
#> The following objects are masked _by_ '.GlobalEnv':
#>
#> exampleRAD_mapping, mout
data("screened_data3")
mappoly.data <- import_data_from_polymapR(screened_data3, ploidy = 4)
plot(mappoly.data)
Now, let us import the phased map used as example in polymapR
data("integrated.maplist", "marker_assignments_P1","marker_assignments_P2")
maplist <- create_phased_maplist(maplist = integrated.maplist,
dosage_matrix.conv = screened_data3,
marker_assignment.1=marker_assignments_P1,
marker_assignment.2=marker_assignments_P2,
ploidy = 4, verbose = FALSE)
mappoly.maplist <- import_phased_maplist_from_polymapR(maplist, mappoly.data)
plot_map_list(mappoly.maplist, col = "ggstyle")
## plot a segment of phased map (from 0 to 20 cM)
plot(mappoly.maplist[[1]], mrk.names = TRUE, left.lim = 0, right.lim = 20, cex = .7)
2.8 Combining multiple datasets
It is not rare to have multiple datasets from the same population or individuals from different sources of molecular data, such as SNP chips, GBS and/or microsatellites. MAPpoly can combine datasets for the same set of individuals using the function merge_datasets. To demonstrate its functionality, let us download two VCF files (autohexaploid sweetpotato) from the MAPpoly vignettes repository on Github, and read them using read_vcf function:
# Downloading VCF files regarding chromosome 1 and 2
download.file("https://github.com/mmollina/MAPpoly_vignettes/raw/master/data/BT/sweetpotato_chr1.vcf.gz", destfile = 'chr1.vcf.gz')
download.file("https://github.com/mmollina/MAPpoly_vignettes/raw/master/data/BT/sweetpotato_chr2.vcf.gz", destfile = 'chr2.vcf.gz')
data1 = read_vcf(file = 'chr1.vcf.gz', parent.1 = "PARENT1", parent.2 = "PARENT2", verbose = FALSE)
data2 = read_vcf(file = 'chr2.vcf.gz', parent.1 = "PARENT1", parent.2 = "PARENT2", verbose = FALSE)
As we can see, both have different markers for the same population:
# See datasets
print(data1)
print(data2)
Now, let us merge them
# Merge datasets
merged_data = merge_datasets(data1, data2)
print(merged_data, detailed = TRUE)
3 Exploratory Analysis
For the purpose of this tutorial, we will keep using the tetraploid potato array data (loaded using the examples above). We will construct a genetic map of the B2721 population, which is a cross between two tetraploid potato varieties: Atlantic and B1829-5. The population comprises 160 offsprings genotyped with the SolCAP Infinium 8303 potato array. The dataset also contains the genomic order of the SNPs from the Solanum tuberosum genome version 4.03. The genotype calling was performed with fitPoly R package using this pipeline. Another option would be to use ClusterCall and this pipeline.
Once the data is loaded, you can explore the dataset using the print function:
print(dat, detailed = TRUE)
#> This is an object of class 'mappoly.data'
#> Ploidy level: 4
#> No. individuals: 156
#> No. markers: 6531
#> Missing data under 0.95 prob. threshold: 5.26%
#> Redundant markers: 7.92%
#>
#> ----------
#> No. markers per sequence:
#> seq No.mrk
#> 1 661
#> 2 613
#> 3 538
#> 4 672
#> 5 485
#> 6 531
#> 7 543
#> 8 469
#> 9 496
#> 10 364
#> 11 470
#> 12 415
#> ----------
#> Markers with no sequence information: 274
#> ----------
#> No. of markers per dosage combination in both parents:
#> P1 P2 freq
#> 0 1 191
#> 0 2 74
#> 0 3 29
#> 1 0 526
#> 1 1 634
#> 1 2 379
#> 1 3 146
#> 1 4 51
#> 2 0 167
#> 2 1 431
#> 2 2 739
#> 2 3 538
#> 2 4 238
#> 3 0 34
#> 3 1 162
#> 3 2 402
#> 3 3 743
#> 3 4 658
#> 4 1 28
#> 4 2 112
#> 4 3 249
This function shows information about the dataset including the ploidy level, total number of individuals, total number of markers, number of informative markers, proportion of missing data and redundant markers. If detailed = TRUE, the function also outputs the number of markers in each sequence (chromosome, in this case), if available, and the number of markers for dosage combinations between both parents.
You can also explore the dataset visually using the plot function:
plot(dat)
The output figure shows a bar plot on the left-hand side with the number of markers in each allele dosage combination between both parents. The right labels indicate allele dosages for Parent 1 and Parent 2, respectively. The upper-right plot contains the \(\log_{10}(p-value)\) from \(\chi^2\) tests for all markers, considering the expected segregation patterns under Mendelian inheritance. The lower-right plot contains a graphical representation of the allele dosages and missing data distribution for all markers and individuals. Finally, the bottom-right graphic shows the proportion of redundant markers in the dataset.
If you want to view a specific marker information, use the plot_mrk_info function. You need to indicate your dataset object using the input.data argument, and the desired marker using the mrk argument. You can indicate the marker using its number or its name (string):
plot_mrk_info(input.data = dat, mrk = 1979)
# or
# plot_mrk_info(input.data = dat, mrk = 'solcap_snp_c2_17752')
When applied to a dosage-based dataset, the function shows a figure with the marker name and position in the dataset, allele dosage in parents 1 and 2, proportion of missing data, p-value of the associated \(\chi^2\) test, sequence and position information (when available). The figure also contains a plot with the allele dosage and missing data distribution in the population. When applied to a probability-based dataset, the function outputs the probability threshold and 3D scatter of the probability distribution for each individual. You can also print the absolute frequency of genotypes using
print_mrk(dat, mrks = 'solcap_snp_c2_17752')
#>
#> solcap_snp_c2_17752
#> ----------------------------------
#> dosage P1: 2
#> dosage P2: 2
#> ----
#> dosage distribution
#> 0 1 2 3 4 mis
#> 1 29 83 38 5 0
#> ----
#> expected polysomic segregation
#> 0 1 2 3 4
#> 0.02777778 0.22222222 0.50000000 0.22222222 0.02777778
#> ----------------------------------
4 Filtering and Quality Control
4.1 Missing data filtering
The function filter_missing filters out markers and/or individuals that exceeds a defined threshold of missing data. The argument input.data should contain your dataset object, and you can choose to filter either by ‘marker’ or ‘individual’ using the type argument (string). You can also define the maximum proportion of missing data using the filter.thres argument (ranges from 0 to 1, i.e. a threshold of 0.05 will keep just markers or individuals with less than 5% of missing data). When TRUE (default), the inter argument plots markers or individuals vs. frequency of missing data.
# Filtering dataset by marker
dat <- filter_missing(input.data = dat, type = "marker",
filter.thres = 0.05, inter = TRUE)
# Filtering dataset by individual
dat <- filter_missing(input.data = dat, type = "individual",
filter.thres = 0.025, inter = TRUE)
print(dat)
#> This is an object of class 'mappoly.data'
#> Ploidy level: 4
#> No. individuals: 144
#> No. markers: 5301
#> Missing data under 0.95 prob. threshold: 0.68%
#> Redundant markers: 9%
#>
#> This dataset contains sequence information.
#> ----------
#> No. of markers per dosage combination in both parents:
#> P1 P2 freq
#> 0 1 159
#> 0 2 56
#> 0 3 6
#> 1 0 486
#> 1 1 499
#> 1 2 298
#> 1 3 106
#> 1 4 14
#> 2 0 133
#> 2 1 338
#> 2 2 518
#> 2 3 430
#> 2 4 209
#> 3 0 13
#> 3 1 122
#> 3 2 334
#> 3 3 622
#> 3 4 625
#> 4 1 17
#> 4 2 86
#> 4 3 230
In this dataset, 1230 markers had a proportion of missing data above the defined threshold (0.05), while 12 individuals exceeded the defined threshold. Then we will use the final filtered dataset during the rest of the analysis. Notice that the function read_vcf also provides parameters to filter out markers depending on their average depth and missing data, as well as removes data points that do not reach a minimum pre-defined depth value. Check the read_vcf section for more information.
4.2 Segregation test
Another important point to be considered is the expected marker segregation pattern under Mendelian inheritance. Here we will use the chi-square (\(\chi^2\)) to assess the segregation ditortion. The test matches expected genotype frequencies against observed frequencies and calculates the associated p-value. In order to define the p-value threshold for the tests, we will use the Bonferroni correction, which a good aproximation is obtained using
\[\alpha_{thres} = \frac{\alpha}{\#markers}\]
We will also assume that only random chromosome bivalent pairing occurs and there is no double reduction.
pval.bonf <- 0.05/dat$n.mrk
mrks.chi.filt <- filter_segregation(dat, chisq.pval.thres = pval.bonf, inter = TRUE)
seq.init <- make_seq_mappoly(mrks.chi.filt)
Notice that filter_segregation does not produce a filtered dataset; it just tells you which markers follow the expected Mendelian segregation pattern. To select these markers from your dataset, you may use make_seq_mappoly function
plot(seq.init)
All redundant markers identified during data reading step are stored in the main dataset. The redundant markers are automatically removed and can be added back once the maps are finished using the function update_map (described in details later in this tutorial).
5 Two-point analysis
Once the markers are selected, we need to compute the pairwise recombination fraction between all of them (two-point analysis). The function est_pairwise_rf is used to estimate all the pairwise recombination fractions between markers in the sequence provided. Since the output object is too big to be fully displayed on the screen, MAPpoly shows only a summary. Parallel computation is available and, if you want to use it, you need to define the available number of cores in your machine and also guarantee that you have sufficient RAM memory for that. Remember that it is always good to leave one core available for the system, and be aware that this step will take a while to compute if you have few cores available.
# Defining number of cores
n.cores = parallel::detectCores() - 1
#(~ 9.9 minutes using 23 cores)
all.rf.pairwise <- est_pairwise_rf(input.seq = seq.init, ncpus = n.cores)
#> INFO: Using 23 CPUs for calculation.
#> INFO: Done with 12427605 pairs of markers
#> INFO: Calculation took: 402.477 seconds
all.rf.pairwise
#> This is an object of class 'poly.est.two.pts.pairwise'
#> -----------------------------------------------------
#> No. markers: 4986
#> No. estimated recombination fractions: 11694917 (94.1%)
#> -----------------------------------------------------
To assess the recombination fraction between a particular pair of markers, say markers 2204 and 4508, you can use the following
all.rf.pairwise$pairwise$`2204-4508`
#> LOD_ph rf LOD_rf
#> 1-1 0.000000 0.05688641 1.340883e+00
#> 2-1 -2.753497 0.43080101 1.098301e-01
#> 1-2 -2.753497 0.43080101 1.098301e-01
#> 1-0 -2.863464 0.49995672 1.369849e-04
#> 0-1 -2.863464 0.49995672 1.369849e-04
#> 2-2 -4.727155 0.47577396 6.847633e-02
#> 2-0 -4.795632 0.49993129 8.815329e-08
#> 0-2 -4.795632 0.49993129 8.815329e-08
#> 0-0 -4.795880 0.49995600 2.486217e-04
In this case, 2204-4508 represents the position of the markers in the filtered data set. The name of the rows have the form x-y, where x and y indicate how many homologous chromosomes share the same allelic variant in parents \(P1\) and \(P2\), respectively (see this figure in Mollinari and Garcia (2019) for notation). The first column indicates the LOD Score in relation to the most likely linkage phase configuration. The second column shows the estimated recombination fraction for each configuration, and the third indicates the LOD Score comparing the likelihood under no linkage (\(r = 0.5\)) with the estimated recombination fraction (evidence of linkage). These results can be plotted using
plot(all.rf.pairwise, first.mrk = 2204, second.mrk = 4508)
## Assembling recombination fraction and LOD Score matrices
Recombination fraction and LOD Score matrices are fundamental in genetic mapping. Later in this tutorial, we will use these matrices as the basic information to order markers and also to perform some diagnostics. To convert the two-point object into recombination fraction and LOD Score matrices, we need to assume thresholds for the three columns observed in the previous output. The arguments thresh.LOD.ph and thresh.LOD.rf set LOD Scores thresholds for the second most likely linkage phase configuration and recombination fraction. Here we assume thresh.LOD.ph = 0 and thresh.LOD.rf = 0, thus no matter how likely is the second best option, the function will consider the most likely. The argument thresh.rf = 0.5 indicates that the maximum accepted recombination fraction is 0.5. To convert these values in a recombination fraction matrix, we use the function rf_list_to_matrix.
mat <- rf_list_to_matrix(input.twopt = all.rf.pairwise)
#> INFO: Going singlemode. Using one CPU.
For big datasets, you may use the multi-core support to perform the conversions, using the parameter ncpus to define the number of CPU’s. We can plot the recombination fraction matrix using plot(mat), however, if the markers are not ordered, this command will yield a heat-map with no distinct pattern, which will have almost no use. Further, in this tutorial, we will show how to use the The UPGMA clustering algorithm to form linkage groups and the MDS algorithm to order markers within each group. For now, let us use the chromosomes and marker order from the reference genome using functionget_genomic_order. If the reference order is consistent with the marker order in this specific population, we should observe a block-diagonal matrix and a monotonic pattern within each sub-matrix. Since the recombination fraction matrix dimensions for the whole genome are usually large, it is possible to summarize neighboring cells’ information using a pre-defined grid. The size of the grid is determined by the aggregation factor fact. Using fact = 5, for instance, will average the recombination fractions of cells within a \(5 \times 5\) grid producing the following heatmap
id<-get_genomic_order(seq.init)
s.o <- make_seq_mappoly(id)
plot(mat, ord = s.o$seq.mrk.names, fact = 5)
As expected, we observe the block-diagonal matrix with monotonic patterns. In the previous case, the thresholds allowed to plot almost all points in the recombination fraction matrix. The empty cells in the matrix (if any) indicate marker combinations where it is impossible to detect recombining events using two-point estimates (e.g., between \(1 \times 0\) and \(0 \times 1\) marker). Yet, if the thresholds become more stringent (higher LODs and lower rf), the matrix becomes more sparse.
6 Assembling linkage groups
The function group_mappoly assign markers to linkage groups using the recombination fraction matrix obtained above. The user can provide an expected number of groups or run the interactive version of the function using inter = TRUE. Since in this data set we expect 12 linkage groups (basic chromosome number in potato), we use expected.groups = 12. If the data set provides the chromosome where the markers are located, the function allows comparing the groups obtained using the pairwise recombination fraction and the chromosome information provided using the comp.mat = TRUE.
grs <- group_mappoly(input.mat = mat,
expected.groups = 12,
comp.mat = TRUE,
inter = TRUE)
grs
#> This is an object of class 'mappoly.group'
#> ------------------------------------------
#> Criteria used to assign markers to groups:
#>
#> - Number of markers: 4986
#> - Number of linkage groups: 12
#> - Number of markers per linkage groups:
#> group n.mrk
#> 1 280
#> 2 380
#> 3 466
#> 4 396
#> 5 303
#> 6 449
#> 7 437
#> 8 385
#> 9 499
#> 10 379
#> 11 503
#> 12 509
#> ------------------------------------------
#> 10 9 7 5 12 3 6 11 4 8 1 2 NoChr
#> 1 224 1 3 2 6 3 3 3 9 5 8 6 7
#> 2 1 341 2 1 1 1 1 1 5 4 7 3 12
#> 3 0 17 400 3 5 5 5 3 14 3 7 2 2
#> 4 3 2 9 337 2 5 6 7 3 3 4 3 12
#> 5 1 1 1 6 267 1 2 1 4 6 1 5 7
#> 6 1 6 3 2 6 400 3 2 5 1 6 12 2
#> 7 2 1 0 2 1 1 409 5 3 2 5 1 5
#> 8 4 5 1 3 3 0 2 345 3 2 7 3 7
#> 9 5 4 4 3 1 2 4 3 440 6 3 7 17
#> 10 4 6 1 2 4 1 0 1 6 328 2 8 16
#> 11 2 5 10 1 2 1 2 3 7 2 447 6 15
#> 12 15 4 3 2 6 13 2 6 0 4 16 433 5
#> ------------------------------------------
Here, we have the 4986 markers distributed in 12 linkage groups. The rows indicate linkage groups obtained using linkage information and the columns are the chromosomes in the reference genome. Notice the diagonal indicating the concordance between the two sources of information. Now, we can plot the resulting marker cluster analysis.
plot(grs)
Once the linkage groups are properly assembled, we use the function make_seq_mappoly to make marker sequences from the group analysis. We will assemble a list with 12 positions, each one containing the corresponding linkage group sequence. Also, we will use only markers allocated in the diagonal of the previous comparison matrix. Thus only markers that were assigned to a particular linkage group using both sources of information will be considered. We will also assemble a smaller two-point object using the functions make_pairs_mappoly and rf_snp_filter to facilitate further parallelization procedures.
# genome correspondence
z <- as.numeric(colnames(grs$seq.vs.grouped.snp)[1:12])
LGS<-vector("list", 12)
for(j in 1:12){
temp1 <- make_seq_mappoly(grs, j, genomic.info = 1)
tpt <- make_pairs_mappoly(all.rf.pairwise, input.seq = temp1)
temp2 <- rf_snp_filter(input.twopt = tpt, diagnostic.plot = FALSE)
lgtemp <- get_genomic_order(temp2)
LGS[[z[j]]] <- list(lg = make_seq_mappoly(lgtemp), tpt = tpt)
}
Now, let us print the recombination fraction matrices for each linkage group.
7 Estimating the map for a given order
In this section, we will construct the genetic map for a single linkage group. Later in this tutorial, we will build the linkage map for all 12 potato linkage groups using a parallelized version of the procedures presented here. In the first part of this section, we will construct the map using the marker order provided by the Solanum tuberosum genome version 4.03. In the second part, we will use the MDS algorithm to obtain a de-novo order of the markers based on the recombination fraction matrix. When a reference genome is available but not fully collinear with the mapping population, the user should consider applying both sources of information, as presented in theSupplemental Material of our sweetpotato map.
The estimation of the genetic map for a given order involves the computation of recombination fraction between adjacent markers and inferring the linkage phase configuration of those markers in both parents. The core function to perform these tasks in MAPpoly is est_rf_hmm_sequential. This function uses the pairwise recombination fraction as the first source of information to sequenlially position allelic variants in specific parental homologs. For situations where pairwise analysis has limited power, the algorithm relies on the likelihood obtained through a hidden Markov model (HMM) Mollinari and Garcia (2019). Once all markers are positioned, the final map is reconstructed using the HMM multipoint algorithm.
Example of linkage phase configuration estimation using sequential search space reduction and HMM evaluation.
Several arguments are available to control the inclusion and phasing of the markers in the chain. The argument start.set defines the number of initial markers used to initiate the map construction. The function performs an exhaustive search for the most likely configuration between the initial markers. For the exhaustive search, all marker pairs in the set are considered. A set of linkage phases for each pair is selected using a two-point LOD Score threshold defined by thres.twopt. The LOD Score is always defined in relation to the best linkage phase, i.e., the best linkage phase has LOD Score zero. Configurations with LOD Scores higher than thres.twopt are not considered. Depending on the number of pairwise linkage phases selected, there are multiple competing maps for the initial set. This number is displayed at the beginning of the analysis if verbose=TRUE. These maps are evaluated using the multilocus HMM-based likelihood. In order to extend the map, the function evaluates if there are no distances between neighbor markers bigger than sub.map.size.diff.limit. If so, the first marker is eliminated, and the next marker is included and the exhaustive search is perfomed again. If not, configurations with multilocus LOD Score lower than thres.hmm will be considered in the next round of marker insertion.
After the initial phase, markers are sequentially added to the end of the map in a similar manner as described before: using both thres.twopt and thres.hmm thresholds. However, for the insertion of a single marker at the end of the map, the function only takes into account the two-point information between the inserted marker and the ones already positioned in the map. At this point, if the two-point information is not sufficient to phase the marker under thres.twopt threshold, the function will evaluate it using the multipoint HMM likelihood. Again, configurations with multilocus LOD Score higher than thres.hmm are eliminated, and the remaining ones will be considered in the next round of marker insertion. The argument extend.tail indicates the number of markers that should be considered at the end of the map to compute the multilocus genetic map when inserting a new marker. Ideally, all markers should be considered; however, this strategy can be very computationally demanding, especially for hundreds of markers in high ploidy levels. Usually, for tetraploids, we suggest to set this number from 30 to 100 and for hexaploid from 50 to 200. Higher numbers are more likely to result in better maps, but they can be time prohibitive.
When inserting single markers at the end of the chain, sub.map.size.diff.limit will control if the marker should be positioned based on the expansion of the map caused by the insertion of the marker. If the difference between the previous map and the current map is larger than sub.map.size.diff.limit, the marker is not inserted. Arguments tol and tol.final receive the desired accuracy to estimate the sub-maps during the sequential phasing procedure and the desired accuracy in the final map. The argument phase.number.limit receives the maximum number of linkage phase configurations to be tested using the multiloucs algorithm. info.tail is a logical argument: if TRUE it uses the complete informative tail (last markers in the chain that allow all homologous to be distinguished in the parents) of the chain to calculate the likelihood of the linkage phases.
First, as an example, let us estimate the map for linkage group 10. The values used for all arguments were obtained using a balance of processing speed and accuracy of the algorithm. As an exercise, it is interesting to try different values and check out the results. For now, let us stick with the following values (this step can be time demanding, depending on chosen parameters):
lg10.map <- est_rf_hmm_sequential(input.seq = LGS[[10]]$lg,
start.set = 10,
thres.twopt = 10,
thres.hmm = 10,
extend.tail = 30,
info.tail = TRUE,
twopt = LGS[[10]]$tpt,
sub.map.size.diff.limit = 8,
phase.number.limit = 20,
reestimate.single.ph.configuration = TRUE,
tol = 10e-3,
tol.final = 10e-4)
#> Number of markers: 214
#> ════════════════════════════════════════════════════════════ Initial sequence ══
#> 10 markers...
#> ● Trying sequence: 1 2 3 4 5 6 7 8 9 10 :
#> 3 phase(s): . . .
#> ══════════════════════════════════════════════════ Done with initial sequence ══
#> 11 /11 :(5.1%) 1548: 1 ph (1/1) -- tail: 10 ||●| ||||
#> 12 /12 :(5.6%) 1552: 1 ph (1/1) -- tail: 11 ||●| ||||
#> 13 /13 :(6.1%) 1555: 2 ph (1/2) -- tail: 12 ●●|| ||●| ●|●| ||●|
#> 14 /14 :(6.5%) 1556: 1 ph (1/1) -- tail: 13 ●|●| ||●|
#> 15 /15 :(7%) 831 : 1 ph (1/1) -- tail: 14 ||●| ||||
#> 16 /16 :(7.5%) 832 : 1 ph (1/1) -- tail: 15 ●|●| ||||
#> 17 /17 :(7.9%) 1589: 1 ph (1/1) -- tail: 16 ||●| ||||
#> 18 /18 :(8.4%) 1619: 2 ph (1/2) -- tail: 17 ●●|● ●●●● |●●● ●●●●
#> 19 /19 :(8.9%) 1659: 3 ph (1/3) -- tail: 18 ●||| ●||| ... ●||| ||●|
#> 20 /20 :(9.3%) 3125: 1 ph (1/1) -- tail: 19 ||●| ||||
#> 21 /21 :(9.8%) 3127: 1 ph (1/1) -- tail: 20 |●●● |●●●
#> 22 /22 :(10.3%) 3124: 1 ph (1/1) -- tail: 21 ●●|● ●●●●
#> 23 /23 :(10.7%) 4718: 1 ph (1/1) -- tail: 22 |●●● |●●●
#> 24 /24 :(11.2%) 4717: 2 ph (1/2) -- tail: 23 |●●| |||| ||●● ||||
#> 25 /25 :(11.7%) 3823: 2 ph (1/2) -- tail: 24 ●●|● ●●●● ●|●● ●●●●
#> 26 /26 :(12.1%) 339 : 1 ph (1/1) -- tail: 25 ●●|● ●●●●
#> 27 /27 :(12.6%) 3824: 1 ph (1/1) -- tail: 26 ●||| ●|||
#> 28 /28 :(13.1%) 3822: 1 ph (1/1) -- tail: 27 ||●| ||||
#> 29 /29 :(13.6%) 1277: 1 ph (1/1) -- tail: 28 ●|●● ●●●●
#> 30 /30 :(14%) 2441: 1 ph (1/1) -- tail: 29 ●||● ●●●●
#> 31 /31 :(14.5%) 2442: 1 ph (1/1) -- tail: 30 ●||● ●●●●
#> 32 /32 :(15%) 2443: 1 ph (1/1) -- tail: 30 ●||● ●●●●
#> 33 /33 :(15.4%) 1278: 1 ph (1/1) -- tail: 30 ●||● ●●●●
#> 34 /34 :(15.9%) 1279: 1 ph (1/1) -- tail: 30 ●●|● ●●●●
#> 35 /35 :(16.4%) 2445: 1 ph (1/1) -- tail: 30 ●||● ●●●●
#> 36 /36 :(16.8%) 1280: 1 ph (1/1) -- tail: 31 |●●| ||||
#> 37 /37 :(17.3%) 1281: 1 ph (1/1) -- tail: 32 ●||● ●●●●
#> 38 /38 :(17.8%) 651 : 4 ph (1/4) -- tail: 33 ●||● ●●●| ... ●||● ●●|●
#> 39 /39 :(18.2%) 308 : 1 ph (1/1) -- tail: 34 |●|| ||||
#> 40 /40 :(18.7%) 3716: 1 ph (1/1) -- tail: 35 ||●| ●|||
#> 41 /41 :(19.2%) 2993: 1 ph (1/1) -- tail: 36 ||●| ●|||
#> 41: not included (map extension)
#> 41 /42 :(19.6%) 2994: 1 ph (1/1) -- tail: 36 ||●| ●|||
#> 42 /43 :(20.1%) 2989: 1 ph (1/1) -- tail: 37 ●●|● |●●●
#> 43 /44 :(20.6%) 2990: 1 ph (1/1) -- tail: 38 ●●|● |●●●
#> 44 /45 :(21%) 1490: 16 ph (1/16) -- tail: 39 ●●●| ●●●| ... ●●●| ●●|●
#> 45 /46 :(21.5%) 4049: 3 ph (1/3) -- tail: 40 ●●|● |●●| ... ●●|● |●|●
#> 46 /47 :(22%) 428 : 1 ph (1/1) -- tail: 30 |||● ||●●
#> 47 /48 :(22.4%) 429 : 1 ph (1/1) -- tail: 30 |||● ||●●
#> 48 /49 :(22.9%) 2075: 1 ph (1/1) -- tail: 30 ||●| ●●||
#> 49 /50 :(23.4%) 2074: 1 ph (1/1) -- tail: 30 ●●|● ||●●
#> 50 /51 :(23.8%) 4608: 1 ph (1/1) -- tail: 30 ●●|● ||●●
#> 51 /52 :(24.3%) 2068: 1 ph (1/1) -- tail: 30 ||●| ●●||
#> 52 /53 :(24.8%) 5249: 6 ph (1/6) -- tail: 30 ●●|| ●●●● ... ●|●| ●●●●
#> 53 /54 :(25.2%) 845 : 1 ph (1/1) -- tail: 30 ||●● ●●●●
#> 54 /55 :(25.7%) 5250: 1 ph (1/1) -- tail: 30 ||●● ●●●●
#> 55 /56 :(26.2%) 844 : 1 ph (1/1) -- tail: 30 ●●|| ||||
#> 56 /57 :(26.6%) 843 : 1 ph (1/1) -- tail: 30 ●●|| ||||
#> 57 /58 :(27.1%) 841 : 1 ph (1/1) -- tail: 30 ||●● ●●●●
#> 58 /59 :(27.6%) 3977: 1 ph (1/1) -- tail: 30 ||●● ●●●●
#> 59 /60 :(28%) 396 : 1 ph (1/1) -- tail: 30 ||●● ●●●●
#> 60 /61 :(28.5%) 395 : 1 ph (1/1) -- tail: 30 ||●● ●●●●
#> 61 /62 :(29%) 394 : 1 ph (1/1) -- tail: 30 ●●|| ||||
#> 62 /63 :(29.4%) 3976: 1 ph (1/1) -- tail: 30 ●●|| ||||
#> 63 /64 :(29.9%) 1200: 1 ph (1/1) -- tail: 30 ||●| ●●||
#> 64 /65 :(30.4%) 3322: 1 ph (1/1) -- tail: 30 ●●|● ||●●
#> 65 /66 :(30.8%) 3095: 1 ph (1/1) -- tail: 30 ●●|● ||●●
#> 66 /67 :(31.3%) 1 : 1 ph (1/1) -- tail: 30 ●●|● ||●●
#> 67 /68 :(31.8%) 5099: 1 ph (1/1) -- tail: 30 ||●| ●●||
#> 68 /69 :(32.2%) 1133: 1 ph (1/1) -- tail: 30 ●●●| ●●||
#> 69 /70 :(32.7%) 2088: 1 ph (1/1) -- tail: 30 ●●|● |●●●
#> 70 /71 :(33.2%) 2089: 2 ph (1/2) -- tail: 31 ●●|| ●||| ●●|| |●||
#> 71 /72 :(33.6%) 2090: 1 ph (1/1) -- tail: 32 ||●| ●|||
#> 72 /73 :(34.1%) 2091: 1 ph (1/1) -- tail: 33 ||●● ●|●●
#> 73 /74 :(34.6%) 3023: 1 ph (1/1) -- tail: 34 ||●| ●●||
#> 74 /75 :(35%) 3022: 1 ph (1/1) -- tail: 35 ●●●| ●●||
#> 75 /76 :(35.5%) 4245: 1 ph (1/1) -- tail: 36 ●●|● ||●●
#> 76 /77 :(36%) 468 : 1 ph (1/1) -- tail: 37 |||● ||●●
#> 77 /78 :(36.4%) 3411: 1 ph (1/1) -- tail: 38 |||● |●●●
#> 78 /79 :(36.9%) 3409: 1 ph (1/1) -- tail: 39 |||● |●●●
#> 79 /80 :(37.4%) 3408: 3 ph (1/3) -- tail: 40 ●●●| ●●|| ... ●●●| ●|●|
#> 80 /81 :(37.9%) 1600: 1 ph (1/1) -- tail: 41 |||● |●●●
#> 81 /82 :(38.3%) 1599: 1 ph (1/1) -- tail: 42 |||● |●●●
#> 82 /83 :(38.8%) 3131: 1 ph (1/1) -- tail: 43 |||| |●||
#> 83 /84 :(39.3%) 4795: 1 ph (1/1) -- tail: 44 |||● |●●●
#> 84 /85 :(39.7%) 4221: 1 ph (1/1) -- tail: 45 ●●●● ●●|●
#> 85 /86 :(40.2%) 3599: 1 ph (1/1) -- tail: 46 ●●●● ●|●●
#> 86 /87 :(40.7%) 3597: 1 ph (1/1) -- tail: 47 ●●●● ●|●●
#> 87 /88 :(41.1%) 3598: 1 ph (1/1) -- tail: 48 |||| |●●|
#> 88 /89 :(41.6%) 250 : 1 ph (1/1) -- tail: 49 ●●●● ●|●●
#> 89 /90 :(42.1%) 249 : 1 ph (1/1) -- tail: 50 ●●●● ●●|●
#> 90 /91 :(42.5%) 248 : 1 ph (1/1) -- tail: 51 |||| |●||
#> 91 /92 :(43%) 247 : 1 ph (1/1) -- tail: 52 ●●●● ●|●●
#> 92 /93 :(43.5%) 246 : 1 ph (1/1) -- tail: 53 ●●●● ●|●●
#> 93 /94 :(43.9%) 245 : 1 ph (1/1) -- tail: 54 |||| |●||
#> 94 /95 :(44.4%) 244 : 1 ph (1/1) -- tail: 55 ●●●● ●|●●
#> 95 /96 :(44.9%) 3596: 1 ph (1/1) -- tail: 56 |||| ||●|
#> 96 /97 :(45.3%) 4871: 4 ph (1/4) -- tail: 57 ●●●| ●●|● ... ●●|● ●●|●
#> 97 /98 :(45.8%) 3627: 1 ph (1/1) -- tail: 30 |●●● ●●|●
#> 98 /99 :(46.3%) 3635: 1 ph (1/1) -- tail: 30 ●||| |●●|
#> 99 /100:(46.7%) 3634: 1 ph (1/1) -- tail: 30 ●||| |●●|
#> 100/101:(47.2%) 3633: 1 ph (1/1) -- tail: 30 |||| |●||
#> 101/102:(47.7%) 3632: 1 ph (1/1) -- tail: 30 |●●● ●||●
#> 102/103:(48.1%) 3631: 1 ph (1/1) -- tail: 30 |●●● ●●|●
#> 103/104:(48.6%) 3630: 1 ph (1/1) -- tail: 30 ●||| |●●|
#> 104/105:(49.1%) 4556: 1 ph (1/1) -- tail: 30 ●||| |●●|
#> 105/106:(49.5%) 4919: 1 ph (1/1) -- tail: 30 |●●● ●||●
#> 106/107:(50%) 3589: 1 ph (1/1) -- tail: 31 ●||| |●●|
#> 107/108:(50.5%) 3588: 1 ph (1/1) -- tail: 32 ●||| |●●|
#> 108/109:(50.9%) 3587: 1 ph (1/1) -- tail: 33 |||| |●||
#> 109/110:(51.4%) 3586: 1 ph (1/1) -- tail: 34 ●||| |●●|
#> 110/111:(51.9%) 3585: 1 ph (1/1) -- tail: 35 ●||| |●●|
#> 111/112:(52.3%) 3584: 1 ph (1/1) -- tail: 36 |||| |●||
#> 112/113:(52.8%) 3583: 1 ph (1/1) -- tail: 37 |||| |●||
#> 113/114:(53.3%) 235 : 1 ph (1/1) -- tail: 38 |●●| ●●||
#> 114/115:(53.7%) 637 : 1 ph (1/1) -- tail: 39 ●||● ||●●
#> 115/116:(54.2%) 3933: 1 ph (1/1) -- tail: 40 |●●| ●●||
#> 116/117:(54.7%) 3932: 1 ph (1/1) -- tail: 41 ●||● ||●●
#> 117/118:(55.1%) 379 : 1 ph (1/1) -- tail: 42 ●||● ||●●
#> 118/119:(55.6%) 3931: 1 ph (1/1) -- tail: 43 ●||● ||●●
#> 119/120:(56.1%) 381 : 1 ph (1/1) -- tail: 44 ●||● ||●●
#> 120/121:(56.5%) 380 : 1 ph (1/1) -- tail: 45 |●●| ●●||
#> 121/122:(57%) 3935: 1 ph (1/1) -- tail: 46 |●●| ●●||
#> 122/123:(57.5%) 3934: 1 ph (1/1) -- tail: 47 |●●| ●●||
#> 123/124:(57.9%) 2478: 1 ph (1/1) -- tail: 48 ●||● ||●●
#> 124/125:(58.4%) 672 : 1 ph (1/1) -- tail: 49 ●||● ||●●
#> 125/126:(58.9%) 2477: 1 ph (1/1) -- tail: 50 ●||● ||●●
#> 126/127:(59.3%) 2476: 1 ph (1/1) -- tail: 51 |●●| ●●||
#> 127/128:(59.8%) 1308: 1 ph (1/1) -- tail: 52 ●●|● ||●●
#> 128/129:(60.3%) 4883: 1 ph (1/1) -- tail: 30 ●●|● ||●●
#> 129/130:(60.7%) 5019: 1 ph (1/1) -- tail: 30 ●●|● ||●●
#> 130/131:(61.2%) 179 : 2 ph (1/2) -- tail: 30 |●●| ●●●| |●●| ●●|●
#> 131: not included (map extension)
#> 130/132:(61.7%) 180 : 1 ph (1/1) -- tail: 30 |||| ●●||
#> 131/133:(62.1%) 4630: 1 ph (1/1) -- tail: 30 ●●|● ||●●
#> 132/134:(62.6%) 4745: 1 ph (1/1) -- tail: 30 ●●|● ||●●
#> 133/135:(63.1%) 2633: 2 ph (1/2) -- tail: 30 ●●|● ||●| ●●|● |||●
#> 134/136:(63.6%) 2635: 1 ph (1/1) -- tail: 30 |||| ●●||
#> 135/137:(64%) 2636: 1 ph (1/1) -- tail: 30 ●●●● ||●●
#> 138: not included (linkage phases)
#> 136/139:(65%) 2638: 3 ph (1/3) -- tail: 30 ||●| ●||| ... ||●| |●||
#> 137/140:(65.4%) 2639: 6 ph (1/6) -- tail: 30 ●●|| ●●●● ... ●|●| ●●●●
#> 138/141:(65.9%) 1460: 1 ph (1/1) -- tail: 30 ||●| ||●|
#> 139/142:(66.4%) 2890: 3 ph (1/3) -- tail: 30 ●●●| ||●| ... ●|●● ||●|
#> 140/143:(66.8%) 3942: 1 ph (1/1) -- tail: 30 ●●|| ||||
#> 141/144:(67.3%) 3941: 1 ph (1/1) -- tail: 30 ||●| ||●|
#> 142/145:(67.8%) 3940: 1 ph (1/1) -- tail: 30 ||●● ●●●●
#> 143/146:(68.2%) 4450: 1 ph (1/1) -- tail: 31 ||●● ||●●
#> 144/147:(68.7%) 4699: 1 ph (1/1) -- tail: 32 ●●●| ●●●|
#> 145/148:(69.2%) 4698: 1 ph (1/1) -- tail: 33 ●●|● ●●|●
#> 146/149:(69.6%) 4697: 1 ph (1/1) -- tail: 34 ||●| ||●|
#> 147/150:(70.1%) 4696: 1 ph (1/1) -- tail: 35 ●●|● |||●
#> 148/151:(70.6%) 4695: 1 ph (1/1) -- tail: 36 ||●| ||●|
#> 149/152:(71%) 4694: 1 ph (1/1) -- tail: 37 ●●|| ||||
#> 150/153:(71.5%) 4693: 1 ph (1/1) -- tail: 38 ||●● ●●●●
#> 151/154:(72%) 4692: 1 ph (1/1) -- tail: 39 ●●|● ●●|●
#> 152/155:(72.4%) 4691: 1 ph (1/1) -- tail: 40 ●●|● ●●|●
#> 153/156:(72.9%) 4432: 1 ph (1/1) -- tail: 41 ||●● ●●●●
#> 154/157:(73.4%) 541 : 1 ph (1/1) -- tail: 42 ||●● ●●●●
#> 155/158:(73.8%) 542 : 1 ph (1/1) -- tail: 43 ●●●| ||●|
#> 156/159:(74.3%) 4441: 1 ph (1/1) -- tail: 44 ●●●| ||●|
#> 157/160:(74.8%) 4442: 1 ph (1/1) -- tail: 45 |||● ●●|●
#> 158/161:(75.2%) 515 : 1 ph (1/1) -- tail: 46 ||●● ●●●●
#> 159/162:(75.7%) 4440: 1 ph (1/1) -- tail: 47 |||● ●●|●
#> 160/163:(76.2%) 36 : 1 ph (1/1) -- tail: 48 |||● |||●
#> 161/164:(76.6%) 37 : 16 ph (1/16) -- tail: 49 ●||| ●||| ... ●||| |●||
#> 162/165:(77.1%) 4862: 1 ph (1/1) -- tail: 30 ●|●● ●|●●
#> 163/166:(77.6%) 151 : 1 ph (1/1) -- tail: 30 ●|●● ●|●●
#> 164/167:(78%) 1046: 2 ph (1/2) -- tail: 30 ●||● ●●|● |●|● ●●|●
#> 165/168:(78.5%) 1048: 1 ph (1/1) -- tail: 30 ●||● ●●|●
#> 166/169:(79%) 1049: 3 ph (1/3) -- tail: 30 ●|●| ●|●● ... ●||● ●|●●
#> 167/170:(79.4%) 1846: 1 ph (1/1) -- tail: 30 ●||● ●●●●
#> 168/171:(79.9%) 1849: 1 ph (1/1) -- tail: 30 ●||● ●●●●
#> 169/172:(80.4%) 1850: 1 ph (1/1) -- tail: 30 |●●| |●||
#> 170/173:(80.8%) 1852: 1 ph (1/1) -- tail: 30 ●●●● ●●|●
#> 174: not included (linkage phases)
#> 171/175:(81.8%) 4140: 1 ph (1/1) -- tail: 30 ●●●● ●|●●
#> 172/176:(82.2%) 4139: 1 ph (1/1) -- tail: 30 ●●●● ●|●●
#> 177: not included (linkage phases)
#> 173/178:(83.2%) 4137: 1 ph (1/1) -- tail: 30 |●|| ||||
#> 174/179:(83.6%) 4136: 4 ph (1/4) -- tail: 30 ●||| |||| ... |●|| ||||
#> 175/180:(84.1%) 442 : 1 ph (1/1) -- tail: 30 ●●●● ●|●●
#> 176/181:(84.6%) 4144: 1 ph (1/1) -- tail: 30 |●|| ||||
#> 177/182:(85%) 4143: 1 ph (1/1) -- tail: 30 ●|●● ●●●●
#> 178/183:(85.5%) 4142: 1 ph (1/1) -- tail: 30 ●|●● ●●●●
#> 179/184:(86%) 4141: 2 ph (1/2) -- tail: 30 ●●●● ●●|| ●●●● |●|●
#> 180/185:(86.4%) 1423: 4 ph (1/4) -- tail: 30 |●|| ●●●| ... |●|| ●●|●
#> 181/186:(86.9%) 1426: 1 ph (1/1) -- tail: 30 ●●●| ●●●●
#> 182/187:(87.4%) 2785: 1 ph (1/1) -- tail: 30 |●|| ●●●|
#> 183/188:(87.9%) 2786: 3 ph (1/3) -- tail: 30 ●|●| |||● ... ●||● |||●
#> 184/189:(88.3%) 2787: 1 ph (1/1) -- tail: 30 |●|| ●●●|
#> 185/190:(88.8%) 2788: 1 ph (1/1) -- tail: 30 |●|| ●●●|
#> 186/191:(89.3%) 2789: 1 ph (1/1) -- tail: 30 |●|| ●●●|
#> 187/192:(89.7%) 337 : 1 ph (1/1) -- tail: 30 ●●●| ●●●●
#> 188/193:(90.2%) 1221: 1 ph (1/1) -- tail: 30 |●|| ●●●|
#> 189/194:(90.7%) 1222: 1 ph (1/1) -- tail: 30 ●|●● |||●
#> 190/195:(91.1%) 1223: 1 ph (1/1) -- tail: 30 ●|●● |||●
#> 191/196:(91.6%) 1225: 1 ph (1/1) -- tail: 30 ●|●● |||●
#> 192/197:(92.1%) 1226: 1 ph (1/1) -- tail: 30 ●|●● ●|●●
#> 193/198:(92.5%) 2283: 1 ph (1/1) -- tail: 30 |●|● ●●●|
#> 194/199:(93%) 465 : 1 ph (1/1) -- tail: 30 |●|| |●||
#> 195/200:(93.5%) 4227: 1 ph (1/1) -- tail: 30 ●|●| |||●
#> 196/201:(93.9%) 4225: 1 ph (1/1) -- tail: 30 ●|●● ●|●●
#> 197/202:(94.4%) 4843: 1 ph (1/1) -- tail: 30 |●|● ●●●|
#> 198/203:(94.9%) 3904: 1 ph (1/1) -- tail: 30 |●|● ●●●|
#> 199/204:(95.3%) 2716: 1 ph (1/1) -- tail: 30 ●|●● ●●●●
#> 200/205:(95.8%) 2712: 6 ph (1/6) -- tail: 31 ●●|| |||| ... ●|●| ||||
#> 201/206:(96.3%) 2709: 1 ph (1/1) -- tail: 31 |●|| ●●●|
#> 202/207:(96.7%) 2717: 1 ph (1/1) -- tail: 32 ●●|| ●●●|
#> 203/208:(97.2%) 3653: 1 ph (1/1) -- tail: 33 ●|●● ●●●●
#> 204/209:(97.7%) 272 : 1 ph (1/1) -- tail: 34 ●●|| ●●●|
#> 205/210:(98.1%) 273 : 1 ph (1/1) -- tail: 35 |●|| ||||
#> 206/211:(98.6%) 3648: 1 ph (1/1) -- tail: 36 ||●● |||●
#> 207/212:(99.1%) 275 : 1 ph (1/1) -- tail: 37 |●|| ||||
#> 208/213:(99.5%) 3650: 1 ph (1/1) -- tail: 38 ||●● |||●
#> 209/214:(100%) 360 : 1 ph (1/1) -- tail: 39 |●|| ||||
#> ══════════════════════════════════ Reestimating final recombination fractions ══
#> Markers in the initial sequence: 214
#> Mapped markers : 209 (97.7%)
#> ════════════════════════════════════════════════════════════════════════════════
Now, use the functions print and plot to view the results:
print(lg10.map)
#> This is an object of class 'mappoly.map'
#> Ploidy level: 4
#> No. individuals: 144
#> No. markers: 209
#> No. linkage phases: 1
#>
#> ---------------------------------------------
#> Number of linkage phase configurations: 1
#> ---------------------------------------------
#> Linkage phase configuration: 1
#> map length: 129.03
#> log-likelihood: -1564.03
#> LOD: 0
#> ~~~~~~~~~~~~~~~~~~
plot(lg10.map)
plot(mat, ord = lg10.map$info$mrk.names, fact = 3)
In the previous figure, black rectangles indicate the presence of the allelic variant in each one of the four homologous in both parents (a,b,*c and, d in parent 1, and e, f, g, and h in parent 2). One also can print a detailed version of the map and plot a specific map segment using the following code.
print(lg10.map, detailed = TRUE)
plot(lg10.map, left.lim = 30, right.lim = 40, mrk.names = TRUE)
7.1 Reestimating map using genotype probabilities
Though current technologies enabled the genotyping of thousands of SNPs, they are prone to genotyping errors, especially in polyploid species. To address this problem MAPpoly uses prior probability distributions for the marker genotypes (as shown here), allowing the HMM to update their probability. This procedure can be applied using either the probability distribution provided by the genotype calling software or assuming a global genotype error. For a detailed explanation of this procedure, please see Mollinari and Garcia (2019). Briefly, the use of a prior information will update the genotype of the markers based on a global chromosome structure. If the probability didtribution of the genotypes is abailable, one can update the map using
lg10.map.prob <- est_full_hmm_with_prior_prob(input.map = lg10.map)
plot(lg10.map.prob)
If the user wants to assume a global genotyping error, say 5%, they should use
lg10.map.err <- est_full_hmm_with_global_error(input.map = lg10.map, error = 0.05)
plot(lg10.map.err)
With a global genotyping error of 5% the resulting map was smaller than the previous maps. Although a smaller map is usually associated with a “high quality” map, this is just a consequence of a much more critical update prompted by the re-estimation procedure: eliminating spurious crossing over event, allowing clear recombination patterns in haplotype reconstruction. We will show an example in a future section.
7.2 Reinserting redundant markers
As mentioned before, redundant markers are automatically removed during the analysis to reduce computational calculations. However, one may want to see the genetic linkage maps considering all markers in the full dataset. The addition of redundant markers to a map can be done by just calling the function update_map. Here we will show the addition in the previous map:
lg10.map.updated = update_map(lg10.map.err)
#> Updating map 1
lg10.map.err
#> This is an object of class 'mappoly.map'
#> Ploidy level: 4
#> No. individuals: 144
#> No. markers: 209
#> No. linkage phases: 1
#>
#> ---------------------------------------------
#> Number of linkage phase configurations: 1
#> ---------------------------------------------
#> Linkage phase configuration: 1
#> map length: 94.4
#> log-likelihood: -2143.46
#> LOD: 0
#> ~~~~~~~~~~~~~~~~~~
lg10.map.updated
#> This is an object of class 'mappoly.map'
#> Ploidy level: 4
#> No. individuals: 144
#> No. markers: 229
#> No. linkage phases: 1
#>
#> ---------------------------------------------
#> Number of linkage phase configurations: 1
#> ---------------------------------------------
#> Linkage phase configuration: 1
#> map length: 94.4
#> log-likelihood: -2143.46
#> LOD: 0
#> ~~~~~~~~~~~~~~~~~~
In this case, the number of markers was increased by 20, however the length of the map is still the same.
8 Ordering markers using MDS and reestimating the map
So far the map was reestimated using the genomic order. However, for some species a good reference genome is not always avaiable and the markers need to be ordered using an optimization technique. Here, we use the MDS (multidimensional scaling) algorithm, proposed in the context of genetic mapping by Preedy and Hackett (2016). The MDS algorithm requires a recombination fraction matrix, which will be transformed in distances using a mapping function (in MAPpoly’s case the Haldane’s mapping function). First, let us gather the pairwise recombination fractions for all linkage groups:
mt <- lapply(LGS, function(x) rf_list_to_matrix(x$tpt))
Now, for each matrix contained in the object mt, we apply the MDS algorithm:
mds.ord <- lapply(mt, mds_mappoly)
Usually at this point, the user can make use of diagnostic plots to remove markers that are disturbing the ordering procedure. Here we didn’t use that procedure, but we encourage the user to check the example in ?mds_mappoly. Now, let us compare the estimated and the genomic orders (feel free to run the last commented line and see interactive plots):
LGS.mds<-vector("list", 12)
for(j in 1:12){
lgtemp <- make_seq_mappoly(mds.ord[[j]])
LGS.mds[[j]] <- list(lg = lgtemp,
tpt = make_pairs_mappoly(all.rf.pairwise, input.seq = lgtemp))
}
geno.vs.mds <- NULL
for(i in 1:length(LGS.mds)){
geno.vs.mds<-rbind(geno.vs.mds,
data.frame(mrk.names = LGS.mds[[i]]$lg$seq.mrk.names,
mds.pos = seq_along(LGS.mds[[i]]$lg$seq.mrk.names),
genomic.pos = order(LGS.mds[[i]]$lg$sequence.pos),
LG = paste0("LG_", i)))
}
require(ggplot2)
p<-ggplot(geno.vs.mds, aes(genomic.pos, mds.pos)) +
geom_point(alpha = 1/5, aes(colour = LG)) +
facet_wrap(~LG) + xlab("Genome Order") + ylab("Map Order")
p
Although it is possible to observe local inconsistencies, the global diagonal pattern indicates a consistent order for all linkage groups using both approaches. Now, let us build the genetic map of linkage group 10 using the MDS order
lg10.map.mds <- est_rf_hmm_sequential(input.seq = LGS.mds[[10]]$lg,
start.set = 10,
thres.twopt = 10,
thres.hmm = 10,
extend.tail = 30,
info.tail = TRUE,
twopt = LGS.mds[[10]]$tpt,
sub.map.size.diff.limit = 10,
phase.number.limit = 20,
reestimate.single.ph.configuration = TRUE,
tol = 10e-3,
tol.final = 10e-4)
#> ════════════════════════════════════════════════════════════ Initial sequence ══
#> ══════════════════════════════════════════════════ Done with initial sequence ══
#> ══════════════════════════════════ Reestimating final recombination fractions ══
#> ════════════════════════════════════════════════════════════════════════════════
And plot the map:
plot(lg10.map.mds)
It is also possible to compare the maps using both genome-based and MDS-based orders with the function plot_map_list:
plot_map_list(list(genome = lg10.map, mds = lg10.map.mds), col = c("#E69F00", "#56B4E9"), title = "")
The genome-based map included 210 markers while the MDS included 222 markers. The smaller size of the genome-based could be interpreted as a better result. Howvere, since it has less markers, this outcome was expected. To formally compare the maps, let us select the markers present in both maps and reestimate them to compare their likelihoods.
mrks.in.gen<-intersect(lg10.map$maps[[1]]$seq.num, lg10.map.mds$maps[[1]]$seq.num)
mrks.in.mds<-intersect(lg10.map.mds$maps[[1]]$seq.num, lg10.map$maps[[1]]$seq.num)
if(cor(mrks.in.gen, mrks.in.mds) < 0){
mrks.in.mds <- rev(mrks.in.mds)
lg10.map.mds <- rev_map(lg10.map.mds)
}
map.comp.3.gen<-get_submap(input.map = lg10.map, match(mrks.in.gen, lg10.map$maps[[1]]$seq.num), verbose = FALSE)
map.comp.3.mds<-get_submap(input.map = lg10.map.mds, match(mrks.in.mds, lg10.map.mds$maps[[1]]$seq.num), verbose = FALSE)
prob.3.gen<-extract_map(lg10.map)
prob.3.mds<-extract_map(lg10.map.mds)
names(prob.3.gen)<-map.comp.3.gen$maps[[1]]$seq.num
names(prob.3.mds)<-map.comp.3.mds$maps[[1]]$seq.num
matplot(t(data.frame(prob.3.gen,prob.3.mds[names(prob.3.gen)])),
type="b", pch="_", col=1, lty=1, lwd = .5, xlab= "",
ylab="Marker position (cM)", axes = F)
axis(2)
mtext(text = round(map.comp.3.gen$maps[[1]]$loglike,1), side = 1, adj = 0)
mtext(text = round(map.comp.3.mds$maps[[1]]$loglike,1), side = 1, adj = 1)
mtext(text = "Genomic", side = 3, adj = 0)
mtext(text = "MDS", side = 3, adj = 1)
Notice that these maps have the same local inversions shown in the dot plots presented earlier. In this case, the log-likelihood of the genomic order is higher than the one obtained using the MDS order. For this linkage group, we have a statistical support to
9 Parallel map construction
9.1 Using one core by LG
Now, the mapping procedure will be applied to all linkage groups using parallelization. Although users must be encouraged to compare both MDS and genomic orders following the previous example, the genomic order will be considered as an example (remember that this step will take a long time to run):
## ~13.3 min
## Performing parallel computation
my.phase.func<-function(X){
x<-est_rf_hmm_sequential(input.seq = X$lg,
start.set = 10,
thres.twopt = 10,
thres.hmm = 10,
extend.tail = 30,
info.tail = TRUE,
twopt = X$tpt,
sub.map.size.diff.limit = 8,
phase.number.limit = 10,
reestimate.single.ph.configuration = TRUE,
tol = 10e-3,
tol.final = 10e-4)
return(x)
}
cl <- parallel::makeCluster(12)
parallel::clusterEvalQ(cl, require(mappoly))
parallel::clusterExport(cl, "dat")
MAPs <- parallel::parLapply(cl,LGS,my.phase.func)
parallel::stopCluster(cl)
A traditional linkage map plot can be generated including all linkage groups, using the function plot_map_list:
plot_map_list(MAPs, col = "ggstyle")
Following the reconstruction of LG 3 shown before, let us consider a global genotyping error of 5% to reestimate the final maps:
my.error.func<-function(X){
x<-est_full_hmm_with_global_error(input.map = X,
error = 0.05,
tol = 10e-4,
verbose = FALSE)
return(x)
}
cl <- parallel::makeCluster(12)
parallel::clusterEvalQ(cl, require(mappoly))
parallel::clusterExport(cl, "dat")
MAP.err <- parallel::parLapply(cl,MAPs,my.error.func)
parallel::stopCluster(cl)
Comparing both results:
all.MAPs <- NULL
for(i in 1:12)
all.MAPs<-c(all.MAPs, MAPs[i], MAP.err[i])
plot_map_list(map.list = all.MAPs, col = rep(c("#E69F00", "#56B4E9"), 12))
Then, the map that included modeling of genotype errors was chosen as the best one.
plot_map_list(MAP.err, col = "ggstyle")
9.2 Map vs. genome
plot_genome_vs_map(MAP.err, same.ch.lg = TRUE)
9.3 Map summary
After building two or more maps, one may want to compare summary statistics of those maps regarding the same chromosome, or even across chromosomes. A brief comparison can be done using the function summary_maps, which generates a table containing these statistics based on a list of mappoly.map objects:
knitr::kable(summary_maps(MAPs))
#>
#> Your dataset contains removed (redundant) markers. Once finished the maps, remember to add them back with the function 'update_map'.
| LG | Genomic sequence | Map length (cM) | Markers/cM | Simplex | Double-simplex | Multiplex | Total | Max gap |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 233.52 | 1.73 | 97 | 106 | 200 | 403 | 8.32 |
| 2 | 2 | 192.19 | 2.13 | 125 | 131 | 153 | 409 | 3.24 |
| 3 | 3 | 166.62 | 2.29 | 151 | 25 | 205 | 381 | 4.87 |
| 4 | 4 | 222.27 | 1.88 | 111 | 79 | 228 | 418 | 5.69 |
| 5 | 5 | 143.31 | 2.09 | 113 | 53 | 133 | 299 | 6.1 |
| 6 | 6 | 170.31 | 2.25 | 64 | 76 | 244 | 384 | 3.94 |
| 7 | 7 | 130.43 | 2.9 | 138 | 73 | 167 | 378 | 6.13 |
| 8 | 8 | 123.23 | 2.53 | 56 | 95 | 161 | 312 | 5.15 |
| 9 | 9 | 150.22 | 2.12 | 73 | 92 | 154 | 319 | 6.52 |
| 10 | 10 | 127.74 | 1.6 | 53 | 26 | 126 | 205 | 6.17 |
| 11 | 11 | 141.63 | 2.24 | 85 | 47 | 185 | 317 | 9.27 |
| 12 | 12 | 146.56 | 1.7 | 112 | 24 | 113 | 249 | 3.69 |
| Total | NA | 1948.03 | 2.12 | 1178 | 827 | 2069 | 4074 | 5.76 |
9.4 Export map
require(polymapR)
dat1<-export_data_to_polymapR(hexafake)
F1checked <- checkF1(dosage_matrix = dat1,
parent1 = "P1",
parent2 = "P2",
F1 = colnames(dat)[-c(1:2)],
polysomic = TRUE,
disomic = FALSE,
mixed = FALSE,
ploidy = 6)
head(F1checked$checked_F1)
#> m MarkerName parent1 parent2 F1_0 F1_1 F1_2 F1_3 F1_4 F1_5 F1_6 F1_NA P1 P2
#> 1 1 M_1 NA NA 0 0 0 0 0 0 0 0 1 2
#> 2 2 M_2 NA NA 0 0 0 0 0 0 0 0 1 0
#> 3 3 M_3 NA NA 0 0 0 0 0 0 0 0 2 1
#> 4 4 M_4 NA NA 0 0 0 0 0 0 0 0 1 1
#> 5 5 M_5 NA NA 0 0 0 0 0 0 0 0 1 0
#> 6 6 M_6 NA NA 0 0 0 0 0 0 0 0 1 0
#> bestfit frqInvalid_bestfit Pvalue_bestfit matchParent_bestfit bestParentfit
#> 1 <NA> NA NA NA <NA>
#> 2 <NA> NA NA NA <NA>
#> 3 <NA> NA NA NA <NA>
#> 4 <NA> NA NA NA <NA>
#> 5 <NA> NA NA NA <NA>
#> 6 <NA> NA NA NA <NA>
#> frqInvalid_bestParentfit Pvalue_bestParentfit matchParent_bestParentfit
#> 1 NA NA NA
#> 2 NA NA NA
#> 3 NA NA NA
#> 4 NA NA NA
#> 5 NA NA NA
#> 6 NA NA NA
#> q1_segtypefit q2_parents q3_fracscored qall_mult qall_weights
#> 1 NA NA NA NA NA
#> 2 NA NA NA NA NA
#> 3 NA NA NA NA NA
#> 4 NA NA NA NA NA
#> 5 NA NA NA NA NA
#> 6 NA NA NA NA NA
10 Genotype conditional probabilities
In order to use the genetic map in QTLpoly, one needs to obtain the conditional probability of all possible 36 genotypes along the 12 linkage groups for all individuals in the full-sib population. Let us use the function calc_genoprob_error, which similarly to est_full_hmm_with_global_error allows the inclusion of a global genotyping error:
genoprob.err <- vector("list", 12)
for(i in 1:12)
genoprob.err[[i]] <- calc_genoprob_error(input.map = MAP.err[[i]], error = 0.05)
Here, a global genotyping error of 5% was used. Each position of any object in the list genoprob.err contains two elements: an array of dimensions \(36 \times number \; of \; markers \times number \; of \; individuals\) and the position of the markers in the maps in centimorgans. Let us display the results for all linkage groups in individual 1:
ind <- 1
op <- par(mfrow = c(3, 4), pty = "s", mar=c(1,1,1,1))
for(i in 1:12)
{
d <- genoprob.err[[i]]$map
image(t(genoprob.err[[i]]$probs[,,ind]),
col=RColorBrewer::brewer.pal(n=9 , name = "YlOrRd"),
axes=FALSE,
xlab = "Markers",
ylab = "",
main = paste("LG", i))
axis(side = 1, at = d/max(d),
labels =rep("", length(d)), las=2)
}
par(op)
In this figure, the x-axis represents the genetic map and the y-axis represents the 36 possible genotypes in the full-sib population. The color scale varies from dark purple (high probabilities) to light yellow (low probabilities). With the conditional probabilities computed, it is possible to use the object genoprob.err alongside with phenotypic data as the input of the software QTLpoly, which is an under development software to map multiple QTLs in full-sib families of outcrossing autopolyploid species.
11 Obtaining individual haplotypes
Once ready, the genotypic conditional probabilities can be used to recover any individual haplotype given the map (details described in Mollinari et al. (2020)). To generate this information, one may use the function calc_homoprob to account for the probabilities of each homologous, in all map positions for each individual. For example, let us view the homologous probabilities for chromosome 1 and individual 10:
homoprobs = calc_homoprob(genoprob.err)
#>
#> Linkage group 1 ...
#> Linkage group 2 ...
#> Linkage group 3 ...
#> Linkage group 4 ...
#> Linkage group 5 ...
#> Linkage group 6 ...
#> Linkage group 7 ...
#> Linkage group 8 ...
#> Linkage group 9 ...
#> Linkage group 10 ...
#> Linkage group 11 ...
#> Linkage group 12 ...
plot(homoprobs, lg = 1, ind = 10)
Using this graphic, it is possible to identify regions of crossing-over occurrence, represented by the inversion of probability magnitudes between homologous from the same parent. It is also possible to view all chromosomes at the same time for any individual by setting the parameter lg = "all". One may use this information to evaluate the quality of the map and repeat some processes with modifications, if necessary.
12 Evaluating the meiotic process
MAPpoly also handles a function to evaluate the meiotic process that guided gamete formation on the studied population. Given genotype conditional probabilities, one may want to account for homologous pairing probabilities and detect the occurrence of preferential pairing, which is possible through the function calc_prefpair_profiles:
prefpairs = calc_prefpair_profiles(genoprob.err)
#>
#> Linkage group 1 ...
#> Linkage group 2 ...
#> Linkage group 3 ...
#> Linkage group 4 ...
#> Linkage group 5 ...
#> Linkage group 6 ...
#> Linkage group 7 ...
#> Linkage group 8 ...
#> Linkage group 9 ...
#> Linkage group 10 ...
#> Linkage group 11 ...
#> Linkage group 12 ...
The function returns an object of class mappoly.prefpair.profiles, which was saved as prefpairs. This object handles all information necessary to study pairing, such as the probability for each pairing configuration (\(\psi\); see Mollinari and Garcia (2019)) inside each parent. For a more user-friendly visualization of the results, one may want to look at the plot output:
plot(prefpairs)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#save.image("all.analysis.rda")
This graphic shows information about all pairing configurations and their probabilities, the proportion of bivalent/multivalent pairing, and also the p-value for preferential pairing test for all markers inside each parent.
References
da Silva Pereira, Guilherme, Dorcus C Gemenet, Marcelo Mollinari, Bode A Olukolu, Joshua C Wood, Federico Diaz, Veronica Mosquera, et al. 2020. “Multiple QTL Mapping in Autopolyploids: A Random-Effect Model Approach with Application in a Hexaploid Sweetpotato Full-Sib Population.” Genetics 215 (3): 579–95. https://doi.org/10.1534/genetics.120.303080.
Mollinari, Marcelo, and Antonio Augusto Franco Garcia. 2019. “Linkage Analysis and Haplotype Phasing in Experimental Autopolyploid Populations with High Ploidy Level Using Hidden Markov Models.” G3 - Genes, Genomes, Genetics 9 (10): 3297–3314. https://doi.org/10.1534/g3.119.400378.
Mollinari, Marcelo, Bode A. Olukolu, Guilherme da S. Pereira, Awais Khan, Dorcus Gemenet, G. Craig Yencho, and Zhao-Bang Zeng. 2020. “Unraveling the Hexaploid Sweetpotato Inheritance Using Ultra-Dense Multilocus Mapping.” G3: Genes|Genomes|Genetics 10 (1): 281–92. https://doi.org/10.1534/g3.119.400620.
Preedy, K. F., and C. A. Hackett. 2016. “A Rapid Marker Ordering Approach for High-Density Genetic Linkage Maps in Experimental Autotetraploid Populations Using Multidimensional Scaling.” Theor. Appl. Genet. 129 (11): 2117–32. https://doi.org/10.1007/s00122-016-2761-8.